What does Speechmatics do?

Speechmatics provides speech technology and Voice AI for enterprises, offering accurate Speech-to-Text, Text-to-Speech, and Voice Agent solutions. Our models understand every voice and accent across 55+ languages, helping businesses unlock the full potential of voice data.

How accurate is Speechmatics Speech-to-Text?

Speechmatics delivers best-in-market accuracy, achieving up to 99% word accuracy and 96% medical keyword recall in industry benchmarks. Our models handle multiple accents, noisy environments, and multi speakers with ease.

What makes Speechmatics Text-to-Speech different?

Our low-latency Text-to-Speech (TTS) delivers lifelike, human-sounding voices with sub-150ms latency that is ideal for real-time conversations. Developers can stream natural speech in multiple voices and deploy it in the cloud, hybrid, or on-prem for privacy and control.

Can I build real-time voice agents with Speechmatics?

Our voice AI enables developers to build real-time voice agents that listen, understand, and respond naturally. Plug in fast with a flexible API and native integrations to power your AI voice agents.

Which industries use Speechmatics?

Speechmatics is trusted by organizations in media, healthcare, contact center, finance, education, and accessibility. Our technology powers transcription, translation, call analytics, and voice AI applications worldwide.

Introducing Real-Time Translation: Breaking Down Language Barriers

Following the release of batch translation in February, real-time translation is now available in our SaaS offering. We provide translation of speech to and from English for 34 languages, tightly integrated with our high-accuracy transcription through a single real-time or batch API. Customers can start using this through our API, further details on how to use it can be found in our docs.

You can see a live demo with a select few languages below:

Our translation builds on top of our state-of-the-art speech-to-text system, and benefits from the substantial improvement in transcription accuracy offered by the Ursa generation models. We previously showed how quality of ASR impacts various downstream tasks. Here we discuss this in the context of translation.

Translation cannot recover from breakdowns in transcription

Unsurprisingly, when transcription breaks down, it is impossible for translation to recover the meaning of the original sentence. Here are some examples from the CoVoST2^[¹^] test set:

	Transcription	Translation
Google	Gets truly how Mr. Your creators.	Ruft sind wirklich wie Mr. Your Creators.
Speechmatics	Cats truly are mysterious creatures.	Katzen sind wirklich geheimnisvolle Kreaturen.
Reference	Cats truly are mysterious creatures.	Katzen sind wirklich geheimnisvolle Geschöpfe.

Help: The comparison text for ASR providers shows how the recognized or translated output compares to the reference. Words in red indicate the errors with substitutions being in italic, deletions being crossed out, and insertions being underlined.

	Transcription	Translation
Google	the Sheep at 13 him that	das Schaf um 13 das beigebracht
Speechmatics	The sheep had taught him that.	Die Schafe hatten ihm das beigebracht.
Reference	The sheep had taught him that.	Die Schafe hatten ihm das beigebracht.

	Transcription	Translation
Google	No activo la frika tropical.	I do not activate the tropical freak.
Speechmatics	Es Nativo de África tropical.	Native of tropical Africa
Reference	Es nativo del África tropical.	Native of tropical Africa.

Of course, the examples above are rather extreme, but we find that even small mistakes from transcription can have a large impact on the resulting translation. Here is an example:

	Transcription	Translation
Google	Elle croit en Tanzanie.	She believes in Tanzania.
Speechmatics	Elle croît en Tanzanie.	It is growing in Tanzania.
Reference	Elle croît en Tanzanie.	It grows in Tanzania

In this context, the French word "croit" means "believe", and the word "croît" means "grow". However, the two are pronounced exactly the same! From the perspective of transcription, substituting one for the other is a minor mistake. Still, as you can see from the Google translation, the mistake causes the English translation to entirely lose the meaning of the original sentence.

Word Error Rates and BLEU Scores

Evaluating the two systems more systematically, we observe that Speechmatics’ lower WERs are associated with higher average BLEU on the CoVoST2 test set. BLEU^[²^] scores are a very commonly used automatic metric for translation quality. They measure the overlap (in terms of words) between the machine generated translation and one or more human generated references.

Figure 1: Transcription Word Error Rate (WER) from Google and Speechmatics on the CoVoST2 speech translation test set. Lower scores are better.

Figure 2: BiLingual Evaluation Understudy (BLEU) scores from Google and Speechmatics on the CoVoST2 speech translation test set. Higher scores are better.

Beyond BLEU Scores

BLEU scores are a convenient way to measure translation quality because they can be computed easily and in a standardized way. However, they are also limited in some ways. They penalize any deviation from the reference translation, even ones that preserve meaning and have the same level of fluency. They put the same weight on every word, even though sometimes a single word can flip the meaning of the entire sentence (e.g. “not”).

Here is an example that illustrates the limitations of BLEU:

	Transcription	Translation	BLEU Score*
Google	Comme partons-nous pour faire.	How are we going to do.	41.11
Speechmatics	Quand partons nous pour Ferrare?	When do we leave for Ferrara?	8.64
Reference	Quand partons-nous pour Ferrare?	When are we going to Ferrare?	100.00

*BLEU is a corpus-based metric and isn’t generally used to evaluate individual sentences. We only include sentence-level BLEU scores here for illustration.

The Speechmatics hypothesis substitutes words 2,4,5 and 6. The Google hypothesis substitutes only words 1 and 6. From the point of view of BLEU scores, the latter is strictly better, despite the fact that the Speechmatics hypothesis matches the meaning of the reference translation much more closely.

In response to BLEU scores’ limitations, people have tried to find better metrics of translation quality, ones that align more closely with human judgement. One such metric is the COMET score^[³^] submitted to the WMT20 Metrics Shared Task by Unbabel. This relies on a pretrained multilingual encoder, XLM-RoBERTa^[⁴^]to create a representation of the source text, the reference text, and the translation hypothesis into a shared feature space. The representations are then fed to a feed forward network which is trained to predict human generated quality assessments. While the absolute values of the scores are hard to interpret, [3] show that they correlate better with human judgements than BLEU scores, indicating that they are a more meaningful way to rank different systems.

Looking at COMET scores on the CoVoST2 test set, we again find that Speechmatics outperforms Google.

Figure 3: COMET scores from Google and Speechmatics on the CoVoST2 speech translation test set. Higher scores are better.

Looking beyond WER and BLEU scores to COMET scores also highlights the importance of capitalization and punctuation. In the following example, Speechmatics and Google transcription both each have one substitution and one insertion. Neither gets the tricky proper noun “Makololos” correctly, but the capitalization in the Speechmatics hypothesis helps preserve the original sentence’s meaning in the translation.

	Transcription	Translation
Google	Pas une trace de ma coloros	Not a trace of my color
Speechmatics	Pas une trace de Mako Lolo.	Not a trace of Mako Lolo.
Reference	Pas une trace de Makololos.	No sign of Makololos.

Challenges of Real-Time Translation

Delivering a high quality real-time translation system poses several challenges beyond translation quality. For one, we would like to minimize the delay between when a word is spoken and when the corresponding translation is returned. However, different languages have very different rules about word orderings, which can make this tricky. One example is that German sentences often have the verb at the end. In order to translate such a sentence into English, we have to wait until the end of the sentence, we cannot do it incrementally. Additionally, waiting for the end of the sentence also implies that we must have a high quality punctuation model to signal the end of sentence. Striking the right balance between gathering enough context for high quality translation and minimizing delay is something we are still actively working on.

Conclusion

Real-Time translation is a new area for us, but we are excited that our strong foundation in ASR enables us to offer a competitive system, which we expect will keep improving in line with our transcription accuracy. In the coming months, we plan to roll out more APIs based on our ASR system, and we hope that these will also benefit from our state-of-the-art word error rates.


References	[1] Wang, C et al. "CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus." arXiv:2007.10310 (2020). [2] Papineni, et al. "Bleu: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting of the Association for Computational Linguistics (2002). [3] Rei, R., et al. "Unbabel’s Participation in the WMT20 Metrics Shared Task." In Proceedings of the Fifth Conference on Machine Translation, pages 911–920, Online. Association for Computational Linguistics (2020). [4] Conneau, A., et al. Unsupervised Cross-lingual Representation Learning at Scale. n Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics (2019).
Author	Caroline Dockes
Acknowledgements	Ana Olssen, Andrew Innes, Benedetta Cevoli, Chris Waple, Dominik Jochec, Dumitru Gutu, Georgina Robertson, James Gilmore, John Hughes, Markus Hennerbichler, Nelson Kondia, Nick Gerig, Owais Aamir Thungalwadi, Owen O'Loan, Stuart Wood, Tom Young, Tomasz Swider, Tudor Evans, Venkatesh Chandran, Vignesh Umapathy and Yahia Abaza.

Apr 20, 2023 | Read time 5 min

Introducing Real-Time Translation: Breaking Down Language Barriers

Translation cannot recover from breakdowns in transcription

Word Error Rates and BLEU Scores

Beyond BLEU Scores

Challenges of Real-Time Translation

Conclusion

Related Articles

Introducing Ursa from Speechmatics

Accuracy Matters When Using GPT-4 and ChatGPT for Downstream Tasks

Achieving Accessibility Through Incredible Accuracy with Ursa